README for the LODifier evaluation data
=======================================
Isabelle Augenstein, Sebastian Pado, Sebastian Rudolph
LODifier Website: http://www.aifb.kit.edu/web/LODifier
Dec 2011/Jan 2012

This README describes the extraction procedure and file format for the
dataset for our quantitative evaluation. It was sampled from the TDT2
dataset [1, 2].

Extraction
----------
The extraction was based on three sources.

a. "tdt2_topic_rel.complete_annot.v3.3", a file which forms part of
   the TDT2 distribution and maps articles from the TDT2 dataset onto
   topics.

b. the "topic list" for the training set on the web page [3] to define
   the "seed" document for each topic.

c. the names of the TDT2 documents which indicate the news outlet they
   come from and thus indicate the language and channel
   [NYT, APW: English newspaper; PRI, VOA_EN: English broadcast news]

As described in the paper, we first determined the seed document for
each topic from (b). Then we filtered this list of obtain the set of
topics whose seed document was either a newspaper or a broadcast news
document, using (c).

To obtain positive pairs, we then paired each seed document with up to
fifty other newspaper or broadcast news "test documents" from the same
topic, using (a) and (c).

Finally, the negative pairs were constructed by shuffling the positive
pairs, i.e. pairing each test document with the seed document of a
random different topic present in the positive pairs.

File format
-----------

The dataset contains two files, positive_pairs and negative_pairs.
Each of the two files consists of two columns, and each line corresponds 
to one document pair. The seed document ID is listed in the first column,
and the test document ID in the second column.
The dataset further contains RDF files for all seed and test documents.


References
----------

[1] Allan, J.: Introduction to topic detection and tracking,
pp. 1–16. Kluwer Academic Publishers, Norwell, MA, USA (2002)

[2] http://projects.ldc.upenn.edu/TDT2/

[3] http://www.ldc.upenn.edu/Projects/TDT2/training.html
